Andres Delgadillo

Project Ensemble Techniques

1 Project: Travel Package Purchase Prediction

1.1 Objective

To predict which customer is more likely to purchase the newly introduced travel package.

1.2 Data Dictionary

Customer details:

Customer interaction data:

2 Import packages and turnoff warnings

3 Import dataset and quality of data

4 Exploratory Data Analysis

4.1 Pandas profiling report

We can get a first statistical and descriptive analysis using pandas_profiling

Pandas Profiling report is showing some warnings/characteristics in the data:

4.2 Univariate Analysis

4.3 Pairplot.

We are going to perform bivariate analysis to understand the relationship between the columns

4.4 Bivariate and Multivariate Analysis

4.4.1 TypeofContact and ProdTaken

4.4.2 CityTier and ProdTaken

4.4.3 Occupation and ProdTaken

4.4.4 Gender and ProdTaken

4.4.5 NumberOfPersonVisiting and ProdTaken

4.4.6 NumberOfFollowups and ProdTaken

4.4.7 Designation and ProdTaken

4.4.8 PreferredPropertyStar and ProdTaken

4.4.9 MaritalStatus and ProdTaken

4.4.10 Passport and ProdTaken

4.4.11 PitchSatisfactionScore and ProdTaken

4.4.12 OwnCar and ProdTaken

4.4.13 NumberOfChildrenVisiting and ProdTaken

4.4.14 Age and ProdTaken

4.4.15 DurationOfPitch and ProdTaken

4.4.16 NumberOfTrips and ProdTaken

4.4.17 MonthlyIncome and ProdTaken

4.5 Insights based on EDA

From the previous Exploratory Data Analysis, we can determine some characteristics of the customers that bought a package:

4.6 Customer Profile of different products

Now, we are going to analyze the characteristics of the customers that bought a package for each different product

4.6.1 Designation and ProductPitched

4.6.2 TypeofContact and ProductPitched

4.6.3 CityTier and ProductPitched

4.6.4 Occupation and ProductPitched

4.6.5 Gender and ProductPitched

4.6.6 ProductPitched and NumberOfPersonVisiting

4.6.7 ProductPitched and NumberOfFollowups

4.6.8 ProductPitched and PreferredPropertyStar

4.6.9 ProductPitched and MaritalStatus

4.6.10 ProductPitched and Passport

4.6.11 ProductPitched and PitchSatisfactionScore

4.4.12 ProductPitched and OwnCar

4.6.13 ProductPitched and NumberOfChildrenVisiting

4.6.14 ProductPitched and Age

4.6.15 DurationOfPitch and ProductPitched

4.6.16 NumberOfTrips and ProductPitched

4.6.17 MonthlyIncome and ProductPitched

4.6.18 Insights for Customer Profile

These are the main characteristics of the customers that bought the different packages

5 Data Pre-Processing

5.1 Feature Engineering

5.2 Missing value treatment

5.2.1 Missing value analysis

From the columns with missing data, TypeofContact is categorical while DurationOfPitch, MonthlyIncome, Age, NumberOfTrips, NumberOfChildrenVisiting, NumberOfFollowups, PreferredPropertyStar are numerical columns

We are going to analyze if there is a pattern for the 25 rows with 3 missing values

For these 25 rows, the 3 missing columns are: TypeofContact, DurationOfPitch and MonthlyIncome

Now, we are going to get the columns with missing values

5.2.2 Missing value imputation

5.3 Outliers detection

We are going to analyze the outliers in DurationOfPitch, MonthlyIncome and NumberOfTrips columns

5.3.1 DurationOfPitch

DurationOfPitch has several outliers. All values above 37 are going to be clipped

5.3.2 MonthlyIncome

MonthlyIncome has several outliers. All values below 15000 and above 40000 are going to be clipped

5.3.3 NumberOfTrips

NumberOfTrips has values above 10. However, it is possible for a customer to have more than 10 trips in a year. Therefore, we are not going to modify these values

5.4 Data Preparation

5.4.1 Creating training and test data sets

Both, training set and test set have similar ratios of classes for ProdTaken

6 Models evaluation criteria

6.1 Insights:

6.1.1 Model can make wrong predictions as:

  1. Predicting a customer has purchased a package but actually the customer does not purchase a package
  2. Predicting a customer does not purchased a package but actually the customer purchase a package

6.1.2 Which case is more important?

6.1.3 How to reduce this loss?

6.2 Functions to evaluate models

7 Model building - Bagging

7.1 Decision tree model

7.2 Bagging classifier

7.2.1 Bagging

7.2.2 Bagging Classifier with weighted decision tree

7.3 Random Forest

7.3.1 Random Forest

7.3.2 Random forest with class weights

7.4 Model Performance Summary

8 Model performance improvement - Bagging

We are going to use f1-score as metric performance with the goal of improve recall score without reduce precision considerably

8.1 Tuning Decision Tree

Now we are going to improve and reduce the complexity of the Decision Tree using cost complexity pruning identifying the optimal ccp_alpha parameter

We are going to train a decision tree using the effective alphas

Now, we are going to analyze how the number of nodes and depth of tree reduces with higher alphas

8.2 Tuning Bagging Classifier

8.3 Tuning Random Forest

8.4 Model Performance Summary

8.5 Feature importance of Tuned Decision Tree

MonthlyIncome, Age and DurationOfPitch are the most important features in the Tuned Decision Tree

9 Model building - Boosting

9.1 AdaBoost Classifier

9.2 Gradient Boosting Classifier

9.3 XGBoost Classifier

9.4 Stacking Classifier

Now, we are going to create a Stacking Classifier using the Tuned Decision Tree, AdaBoost Classifier, Tuned Random Forest and the XGBoost Classifier

9.5 Model Performance Summary

10 Model performance improvement - Boosting

We are going to use f1-score as metric performance with the goal of improve recall score without reduce precision considerably

10.1 Tuned AdaBoost Classifier

10.2 Tuned Gradient Boosting Classifier

10.3 Tuned XGBoost Classifier

10.4 Tuned Stacking Classifier

Now, we are going to create a Tuned Stacking Classifier using the Tuned Decision Tree, Tuned AdaBoost Classifier, Tuned Random Forest, Tuned Gradient Boosting Classifier, and Tuned XGBoost Classifier

10.5 Model Performance Summary

10.6 Feature importance of Tuned XGBoost Classifier

Passport, MaritalStatus_Single and ProdcutPitched_Super Deluxe are the most important features in the Tuned XGBoost Classifier

11 Actionable Insights and Recommendations